singleton cluster
- Asia > Afghanistan > Parwan Province > Charikar (0.07)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (2 more...)
Learning-Augmented Streaming Algorithms for Correlation Clustering
Dong, Yinhao, Jiang, Shan, Li, Shi, Peng, Pan
We study streaming algorithms for Correlation Clustering. Given a graph as an arbitrary-order stream of edges, with each edge labeled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we give the first learning-augmented streaming algorithms for the problem on both complete and general graphs, improving the best-known space-approximation tradeoffs. Based on the works of Cambus et al. (SODA'24) and Ahn et al. (ICML'15), our algorithms use the predictions of pairwise distances between vertices provided by a predictor. For complete graphs, our algorithm achieves a better-than-$3$ approximation under good prediction quality, while using $\tilde{O}(n)$ total space. For general graphs, our algorithm achieves an $O(\log |E^-|)$ approximation under good prediction quality using $\tilde{O}(n)$ total space, improving the best-known non-learning algorithm in terms of space efficiency. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.07)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (2 more...)
Soft-ECM: An extension of Evidential C-Means for complex data
Soubeiga, Armel, Guyet, Thomas, Antoine, Violaine
Clustering based on belief functions has been gaining increasing attention in the machine learning community due to its ability to effectively represent uncertainty and/or imprecision. However, none of the existing algorithms can be applied to complex data, such as mixed data (numerical and categorical) or non-tabular data like time series. Indeed, these types of data are, in general, not represented in a Euclidean space and the aforementioned algorithms make use of the properties of such spaces, in particular for the construction of barycenters. In this paper, we reformulate the Evidential C-Means (ECM) problem for clustering complex data. We propose a new algorithm, Soft-ECM, which consistently positions the centroids of imprecise clusters requiring only a semi-metric. Our experiments show that Soft-ECM present results comparable to conventional fuzzy clustering approaches on numerical data, and we demonstrate its ability to handle mixed data and its benefits when combining fuzzy clustering with semi-metrics such as DTW for time series data.
Enhancing Interpretability of Quantum-Assisted Blockchain Clustering via AI Agent-Based Qualitative Analysis
Tsai, Yun-Cheng, Liu, Yen-Ku, Chen, Samuel Yen-Chi
Blockchain transaction data is inherently high dimensional, noisy, and entangled, posing substantial challenges for traditional clustering algorithms. While quantum enhanced clustering models have demonstrated promising performance gains, their interpretability remains limited, restricting their application in sensitive domains such as financial fraud detection and blockchain governance. To address this gap, we propose a two stage analysis framework that synergistically combines quantitative clustering evaluation with AI Agent assisted qualitative interpretation. In the first stage, we employ classical clustering methods and evaluation metrics including the Silhouette Score, Davies Bouldin Index, and Calinski Harabasz Index to determine the optimal cluster count and baseline partition quality. In the second stage, we integrate an AI Agent to generate human readable, semantic explanations of clustering results, identifying intra cluster characteristics and inter cluster relationships. Our experiments reveal that while fully trained Quantum Neural Networks (QNN) outperform random Quantum Features (QF) in quantitative metrics, the AI Agent further uncovers nuanced differences between these methods, notably exposing the singleton cluster phenomenon in QNN driven models. The consolidated insights from both stages consistently endorse the three cluster configuration, demonstrating the practical value of our hybrid approach. This work advances the interpretability frontier in quantum assisted blockchain analytics and lays the groundwork for future autonomous AI orchestrated clustering frameworks.
- North America > United States > Indiana > Monroe County > Bloomington (0.06)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (2 more...)
- Banking & Finance (0.46)
- Law Enforcement & Public Safety > Fraud (0.34)
How to characterize imprecision in multi-view clustering?
Xu, Jinyi, Zhang, Zuowei, Lin, Ze, Chen, Yixiang, Liu, Zhe, Ding, Weiping
It is still challenging to cluster multi-view data since existing methods can only assign an object to a specific (singleton) cluster when combining different view information. As a result, it fails to characterize imprecision of objects in overlapping regions of different clusters, thus leading to a high risk of errors. In this paper, we thereby want to answer the question: how to characterize imprecision in multi-view clustering? Correspondingly, we propose a multi-view low-rank evidential c-means based on entropy constraint (MvLRECM). The proposed MvLRECM can be considered as a multi-view version of evidential c-means based on the theory of belief functions. In MvLRECM, each object is allowed to belong to different clusters with various degrees of support (masses of belief) to characterize uncertainty when decision-making. Moreover, if an object is in the overlapping region of several singleton clusters, it can be assigned to a meta-cluster, defined as the union of these singleton clusters, to characterize the local imprecision in the result. In addition, entropy-weighting and low-rank constraints are employed to reduce imprecision and improve accuracy. Compared to state-of-the-art methods, the effectiveness of MvLRECM is demonstrated based on several toy and UCI real datasets.
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > France (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
Sifting out communities in large sparse networks
Climer, Sharlee, Smith, Kenneth Jr, Yang, Wei, Fuentes, Lisa de las, Dávila-Román, Victor G., Gu, C. Charles
Research data sets are growing to unprecedented sizes and network modeling is commonly used to extract complex relationships in diverse domains, such as genetic interactions involved in disease, logistics, and social communities. As the number of nodes increases in a network, an increasing sparsity of edges is a practical limitation due to memory restrictions. Moreover, many of these sparse networks exhibit very large numbers of nodes with no adjacent edges, as well as disjoint components of nodes with no edges connecting them. A prevalent aim in network modeling is the identification of clusters, or communities, of nodes that are highly interrelated. Several definitions of strong community structure have been introduced to facilitate this task, each with inherent assumptions and biases. We introduce an intuitive objective function for quantifying the quality of clustering results in large sparse networks. We utilize a two-step method for identifying communities which is especially well-suited for this domain as the first step efficiently divides the network into the disjoint components, while the second step optimizes clustering of the produced components based on the new objective. Using simulated networks, optimization based on the new objective function consistently yields significantly higher accuracy than those based on the modularity function, with the widest gaps appearing for the noisiest networks. Additionally, applications to benchmark problems illustrate the intuitive correctness of our approach. Finally, the practicality of our approach is demonstrated in real-world data in which we identify complex genetic interactions in large-scale networks comprised of tens of thousands of nodes. Based on these three different types of trials, our results clearly demonstrate the usefulness of our two-step procedure and the accuracy of our simple objective.
- North America > United States > Missouri > St. Louis County > St. Louis (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Maryland (0.04)
- (3 more...)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.48)
Compositional Clustering: Applications to Multi-Label Object Recognition and Speaker Identification
Li, Zeqian, He, Xinlu, Whitehill, Jacob
The goal is not just to partition the data into distinct and coherent groups, but also to infer the compositional relationships among the groups. This scenario arises in speaker diarization (i.e., infer who is speaking when from an audio wave) in the presence of simultaneous speech from multiple speakers [6, 36], which occurs frequently in real-world speech settings: The audio at each time t is generated as a composition of the voices of all the people speaking at time t, and the goal is to cluster the audio samples, over all timesteps, into sets of speakers. Hence, if there are 2 people who sometimes speak by themselves and sometimes speak simultaneously, then the clusters would correspond to the speaker sets {1}, {2}, and {1, 2} - the third cluster is not a third independent speaker, but rather the composition of the first two speakers. An analogous scenario arises in open-world (i.e., test classes are disjoint from training classes) multi-label object recognition when clustering images such that each image may contain multiple objects from a fixed set (e.g., the shapes in Figure 1). In some scenarios, the composition function that specifies how examples are generated from other examples might be as simple as superposition by element-wise maximum or addition. However, a more powerful form of composition - and the main motivation for our work - is enabled by compositional embedding models, which are a new technique for few-shot learning.
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!
Chakrabarty, Sayak, Makarychev, Konstantin
We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation using O(n/{\epsilon}) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 + {\epsilon})-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of this paper is that both the algorithm and its analysis are very simple, and also the algorithm is easy to implement.
- Asia > Afghanistan > Parwan Province > Charikar (0.27)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (2 more...)
Clustering US Counties to Find Patterns Related to the COVID-19 Pandemic
Brown, Cora, Milstein, Sarah, Sun, Tianyi, Zhao, Cooper
When COVID-19 first started spreading and quarantine was implemented, the Society for Industrial and Applied Mathematics (SIAM) Student Chapter at the University of Minnesota-Twin Cities began a collaboration with Ecolab to use our skills as data scientists and mathematicians to extract useful insights from relevant data relating to the pandemic. This collaboration consisted of multiple groups working on different projects. In this write-up we focus on using clustering techniques to help us find groups of similar counties in the US and use that to help us understand the pandemic. Our team for this project consisted of University of Minnesota students Cora Brown, Sarah Milstein, Tianyi Sun, and Cooper Zhao, with help from Ecolab Data Scientist Jimmy Broomfield and University of Minnesota student Skye Ke. In the sections below we describe all of the work done for this project. In Section 2, we list the data we gathered, as well as the feature engineering we performed. In Section 3, we describe the metrics we used for evaluating our models. In Section 4, we explain the methods we used for interpreting the results of our various clustering approaches. In Section 5, we describe the different clustering methods we implemented. In Section 6, we present the results of our clustering techniques and provide relevant interpretation. Finally, in Section 7, we provide some concluding remarks comparing the different clustering methods.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Michigan > Wayne County > Wayne (0.04)
- North America > United States > Texas > Dallas County > Dallas (0.04)
- (26 more...)
- Health & Medicine > Epidemiology (0.86)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.63)
- Health & Medicine > Therapeutic Area > Immunology (0.63)